SuMO specific genes in pure vs differentiated cellsbrain genesliver development genesedgeR)## [1] 67
## [1] 93
## Source: local data frame [3 x 2]
##
## state mean(map)
## (chr) (dbl)
## 1 NTC 0.3383333
## 2 Single-cell 30.3183333
## 3 Ten-cell 40.6050000
## Source: local data frame [3 x 2]
##
## state mean(conversion_rate)
## (chr) (dbl)
## 1 NTC 95.71250
## 2 Single-cell 97.37414
## 3 Ten-cell 96.95000
## Source: local data frame [3 x 2]
##
## state mean(cpg_count)
## (chr) (dbl)
## 1 NTC 3298.167
## 2 Single-cell 401970.006
## 3 Ten-cell 2328151.333
## [1] "Loading CNV Data"
## [1] 62
## [1] 86
## <environment: R_GlobalEnv>
## Joining by: "id"
## Joining by: "cell"
Calling CpG resolution hetereogeneity
SC pairwise heterogeneity - Making comparisons and calculating heterogeneity
Plot heterogeneity distribution
Removed outliers ESLAM-GAGTGG and ESLAM-TTAGGC
## Source: local data frame [2 x 2]
##
## cluster n
## (int) (int)
## 1 1 97
## 2 2 49
## [1] "13 LSK in cluster | 62 Total LSK"
## [1] "84 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 44
## 2 2 15
## 3 3 25
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "25 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 44
## 2 2 15
## 3 3 25
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "44 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 56
## 2 2 13
## 3 3 77
## [1] "18 LSK in cluster | 62 Total LSK"
## [1] "38 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 27
## 2 2 33
## 3 3 86
## [1] "12 LSK in cluster | 62 Total LSK"
## [1] "21 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [6 x 2]
##
## cluster n
## (int) (int)
## 1 1 46
## 2 2 33
## 3 3 32
## 4 4 28
## 5 5 6
## 6 6 1
## [1] "10 LSK in cluster | 62 Total LSK"
## [1] "18 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [6 x 2]
##
## cluster n
## (int) (int)
## 1 1 46
## 2 2 33
## 3 3 32
## 4 4 28
## 5 5 6
## 6 6 1
## [1] "18 LSK in cluster | 62 Total LSK"
## [1] "14 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [4 x 2]
##
## cluster n
## (int) (int)
## 1 1 57
## 2 2 40
## 3 3 48
## 4 4 1
## [1] "19 LSK in cluster | 62 Total LSK"
## [1] "21 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [6 x 2]
##
## cluster n
## (int) (int)
## 1 1 45
## 2 2 33
## 3 3 34
## 4 4 31
## 5 5 2
## 6 6 1
## [1] "13 LSK in cluster | 62 Total LSK"
## [1] "20 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [6 x 2]
##
## cluster n
## (int) (int)
## 1 1 45
## 2 2 33
## 3 3 34
## 4 4 31
## 5 5 2
## 6 6 1
## [1] "20 LSK in cluster | 62 Total LSK"
## [1] "25 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 53
## 2 2 92
## 3 3 1
## [1] "35 LSK in cluster | 62 Total LSK"
## [1] "57 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [5 x 2]
##
## cluster n
## (int) (int)
## 1 1 52
## 2 2 55
## 3 3 27
## 4 4 11
## 5 5 1
## [1] "6 LSK in cluster | 62 Total LSK"
## [1] "21 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [5 x 2]
##
## cluster n
## (int) (int)
## 1 1 52
## 2 2 55
## 3 3 27
## 4 4 11
## 5 5 1
## [1] "25 LSK in cluster | 62 Total LSK"
## [1] "30 ESLAM in cluster | 84 Total ESLAM"
## Source: local data frame [4 x 2]
##
## cluster n
## (int) (int)
## 1 1 122
## 2 2 10
## 3 3 9
## 4 4 5
## [1] "55 LSK in cluster | 62 Total LSK"
## [1] "67 ESLAM in cluster | 84 Total ESLAM"
## [1] 1
Define subcluster of purity as cells that fall into clusters
Grab aggregated CpG per cell populationand group cells belonging to the same “cluster” together in silico using BSmooth
##
Read 16.5% of 21867550 rows
Read 43.9% of 21867550 rows
Read 71.9% of 21867550 rows
Read 99.9% of 21867550 rows
Read 21867550 rows and 3 (of 3) columns from 0.486 GB file in 00:00:06
## Error in `[.data.table`(.SD, , 3:12, with = FALSE): j out of bounds
## [BSmooth] preprocessing ... done in 15.9 sec
## [BSmooth] smoothing by 'chromosome' (mc.cores = 20, mc.preschedule = FALSE)
## [BSmooth] smoothing done in 930.0 sec
Find DMRs from in silico bulk groups
Call DMRs
Criteria - differentially methylated CpG’s (dmCpG) are called if they differ by (a statistical difference).
## [1] -0.2703689 0.2749154
Then, nearby dmCpG’s are grouped together into DMRs if they are within some distance of each other.
## [1] 15617
## Joining by: c("chr", "bin", "start", "end")
There are 14104 DMRs between pure and unpure
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Defining activating and repressing regions as follows:
Negative Correlating regions (less methylation = more expression)
## [1] "3009 of 14104 DMRs lie in regions of epigenetic potential"
## Source: local data frame [3 x 2]
##
## Annotation n
## (chr) (int)
## 1 exon1 66
## 2 Intergenic 1689
## 3 promoter-TSS 1254
## [1] "124 genes that associate with both activated and repressed regions, out of 2382 total associated genes, are discarded (along with their regions)"
## Source: local data frame [2 x 2]
##
## state n
## (chr) (int)
## 1 hypo_in_pure 1301
## 2 hypo_in_unpure 1396
## [plotManyRegions] preprocessing ...done
## [plotManyRegions] plotting region 1 (out of 5)
## [plotManyRegions] plotting region 2 (out of 5)
## [plotManyRegions] plotting region 3 (out of 5)
## [plotManyRegions] plotting region 4 (out of 5)
## [plotManyRegions] plotting region 5 (out of 5)
## png
## 2
## [plotManyRegions] preprocessing ...done
## [plotManyRegions] plotting region 1 (out of 5)
## [plotManyRegions] plotting region 2 (out of 5)
## [plotManyRegions] plotting region 3 (out of 5)
## [plotManyRegions] plotting region 4 (out of 5)
## [plotManyRegions] plotting region 5 (out of 5)
## png
## 2
Criteria - differentially methylated CpG’s (dmCpG) are called if they differ by (a statistical difference).
## [1] -0.2519766 0.2411145
## Joining by: c("chr", "bin", "start", "end")
## [1] 21583
## [1] "4284 of 21583 LSK/ESLAM DMRs lie in regions of epigenetic potential"
## Source: local data frame [3 x 2]
##
## Annotation n
## (chr) (int)
## 1 exon1 85
## 2 Intergenic 2568
## 3 promoter-TSS 1631
## [1] "237 genes that associate with both LSK and ESLAM regions, out of 3109 total associated genes, are discarded (along with their regions)"
## Source: local data frame [2 x 2]
##
## state n
## (chr) (int)
## 1 hypo_in_eslam 1772
## 2 hypo_in_lsk 1901
pure and unpurelsk and eslam“First, we estimated whether each transcription factor was expressed in each cell type. We used RNA-seq measurements, defining “expressed” as having more than 5 reads per million tags, and higher than 20% of its maximal expression levels across all cells.”
DNMT3a and 3b double-KO paper
http://www.cell.com/cell-stem-cell/abstract/S1934-5909(14)00266-5
Single-cell HSC transcriptome
http://link.springer.com/article/10.1186/s13059-015-0739-5/fulltext.html
Single-cell RNA-seq of HSCs reveal transcriptional heterogeneity largely explained by cell cycle related genes
Aging-related genes
Pure = cluster 1 = the “close together” one = Orange
Unpure = cluster 2 = the more “spread out” one = Green
Using GREAT tool with settings Basal+extension (constitutive 5 kb upstream and 1 kb downstream, up to 20 kb max extension)
## Don't make too frequent requests. The time break is 10s.
## Please wait for 2s for the next request.
## The time break can be set by `request_interval` argument.
##
|
| | 0%
|
|================================= | 51%
Some terms popped up where promoter motifs were enriched for certain TFs - check the methylation around the promoter of said TF to guestimate if expression of that TF is different
SuMO specific genes in pure vs differentiated cellsSuMO cells are ~60% pure
From table S4, top positive and negative correlated genes (http://www.sciencedirect.com/science/article/pii/S1934590915001629)
(genomic coordinates obtained from ensTrans genome browser, mm10)
Heatmap of all “genes positively correlated in SuMO cells” - basically genes that are higher expressed in SuMO vs rest.
## Joining by: "V1"
| ID | HSC1_1 | HSC1_2 | HSC1_3 | HSC1_4 | HSC1_5 | HSC1_6 | HSC1_7 | HSC1_8 |
|---|---|---|---|---|---|---|---|---|
| 0610005C13Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610007P14Rik | 25 | 0 | 304 | 0 | 154 | 550 | 117 | 177 |
| 0610008F07Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610009B14Rik | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610009B22Rik | 0 | 0 | 0 | 8 | 83 | 521 | 236 | 57 |
| 0610009D07Rik | 241 | 0 | 66 | 60 | 356 | 173 | 385 | 64 |
Distribution of total read counts in single HSCs
Percentage of reads in mitochondrial genes
## [1] 92
## Joining by: "cell"
## initial value 29.218440
## iter 5 value 24.917247
## iter 10 value 24.072466
## final value 24.004633
## converged
## Error in tbl_vars(y): object 'sc_rna_sumo_mean' not found
This is reported by the paper - ~4500 genes with variation greater than technical variation
Instead, perform a one-sided test against the null hypothesis that the true variance is at most the technical variation
## [1] 0.7371249
##
## FALSE TRUE
## 20106 1025
We get 1025 DE genes wthat fail the null hypothesis where the CV is less than technical
## [1] "44 DMR-associated genes not present in SC-RNAseq data"
| V1 | V2 |
|---|---|
| total_RNAseq_genes | 19997 |
| DMR_associated_genes | 469 |
| Differentially_Expressed | 894 |
| Overlap | 33 |
## [1] "The p-value of this overlap is 0.00703898464946137"
## [1] "The probability that the distributions of expression of DMR genes and all genes are equal is 1.09055764392958e-11 using a Mann-Whitney non-parametric test."
## [1] "63 genes with 0 average expression across single cells"
1-spearman correlation for distance## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 49
## 2 2 35
## 3 3 8
## Source: local data frame [3 x 2]
##
## cluster n
## (int) (int)
## 1 1 49
## 2 2 35
## 3 3 8
## [1] "35 HSCs within cluster 1 out of 92 HSCs"
## [1] "49 HSCs within cluster 2 out of 92 HSCs"
1-spearman correlation for distance## Source: local data frame [5 x 2]
##
## cluster n
## (int) (int)
## 1 1 41
## 2 2 34
## 3 3 2
## 4 4 11
## 5 5 4
## Source: local data frame [5 x 2]
##
## cluster n
## (int) (int)
## 1 1 41
## 2 2 34
## 3 3 2
## 4 4 11
## 5 5 4
## [1] "41 HSCs within cluster 1 out of 92 HSCs"
## [1] "34 HSCs within cluster 2 out of 92 HSCs"
brain genes## Source: local data frame [4 x 2]
##
## cluster n
## (int) (int)
## 1 1 35
## 2 2 22
## 3 3 32
## 4 4 3
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"
## Source: local data frame [4 x 2]
##
## cluster n
## (int) (int)
## 1 1 35
## 2 2 22
## 3 3 32
## 4 4 3
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"
## [1] "22 HSCs within cluster 1 out of 92 HSCs"
## [1] "32 HSCs within cluster 2 out of 92 HSCs"
liver development genes## Source: local data frame [4 x 2]
##
## cluster n
## (int) (int)
## 1 1 25
## 2 2 29
## 3 3 13
## 4 4 25
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"
## Source: local data frame [4 x 2]
##
## cluster n
## (int) (int)
## 1 1 25
## 2 2 29
## 3 3 13
## 4 4 25
## [1] "0 LSK in cluster | 0 Total LSK"
## [1] "0 ESLAM in cluster | 0 Total ESLAM"
## [1] "25 HSCs within cluster 1 out of 92 HSCs"
## [1] "25 HSCs within cluster 2 out of 92 HSCs"
edgeR)## Design matrix not provided. Switch to the classic mode.
## Higher expressed in Pure Number of DE genes
## 1 FALSE 2
## 2 TRUE 51
## [1] "The p-value of the overlap between genes with biological variation and DE genes between groups is 0"
## Joining by: "gene"
## [1] "Genes that are DE between groups AND have promoter DMRS: Eif5a,P4hb,Eef1g"